Data Description¶

This is a countrywide car accident dataset that covers 49 states of the USA. The accident data were collected from February 2016 to March 2023, using multiple APIs that provide streaming traffic incident (or event) data. These APIs broadcast traffic data captured by various entities, including the US and state departments of transportation, law enforcement agencies, traffic cameras, and traffic sensors within the road networks. The dataset currently contains approximately 7.7 million accident records.

We will examine the car accidents covering 49 states of the USA for the years 2020 - 2022. You can access the original of the data from the link (https://www.kaggle.com/datasets/sobhanmoosavi/us-accidents/data).

Importing Required Libraries¶

In [1]:
import numpy as np 
import pandas as pd
import json
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objs as go
from datetime import datetime
import warnings 
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder,LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

warnings.filterwarnings('ignore')

Data Handeling and Editing¶

In [43]:
carAccident = pd.read_csv("US_Accidents_March23.csv")
carAccident.head()
Out[43]:
ID Source Severity Start_Time End_Time Start_Lat Start_Lng End_Lat End_Lng Distance(mi) ... Roundabout Station Stop Traffic_Calming Traffic_Signal Turning_Loop Sunrise_Sunset Civil_Twilight Nautical_Twilight Astronomical_Twilight
0 A-1 Source2 3 2016-02-08 05:46:00 2016-02-08 11:00:00 39.865147 -84.058723 NaN NaN 0.01 ... False False False False False False Night Night Night Night
1 A-2 Source2 2 2016-02-08 06:07:59 2016-02-08 06:37:59 39.928059 -82.831184 NaN NaN 0.01 ... False False False False False False Night Night Night Day
2 A-3 Source2 2 2016-02-08 06:49:27 2016-02-08 07:19:27 39.063148 -84.032608 NaN NaN 0.01 ... False False False False True False Night Night Day Day
3 A-4 Source2 3 2016-02-08 07:23:34 2016-02-08 07:53:34 39.747753 -84.205582 NaN NaN 0.01 ... False False False False False False Night Day Day Day
4 A-5 Source2 2 2016-02-08 07:39:07 2016-02-08 08:09:07 39.627781 -84.188354 NaN NaN 0.01 ... False False False False True False Day Day Day Day

5 rows × 46 columns

In [44]:
carAccident['Year'] = carAccident['Start_Time'].apply(lambda x: x[:4])
carAccident['Month'] = carAccident['Start_Time'].apply(lambda x: x[5:7])
carAccident['Start_Hour'] = carAccident['Start_Time'].apply(lambda x: x[11:13])
carAccident.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7728394 entries, 0 to 7728393
Data columns (total 49 columns):
 #   Column                 Dtype  
---  ------                 -----  
 0   ID                     object 
 1   Source                 object 
 2   Severity               int64  
 3   Start_Time             object 
 4   End_Time               object 
 5   Start_Lat              float64
 6   Start_Lng              float64
 7   End_Lat                float64
 8   End_Lng                float64
 9   Distance(mi)           float64
 10  Description            object 
 11  Street                 object 
 12  City                   object 
 13  County                 object 
 14  State                  object 
 15  Zipcode                object 
 16  Country                object 
 17  Timezone               object 
 18  Airport_Code           object 
 19  Weather_Timestamp      object 
 20  Temperature(F)         float64
 21  Wind_Chill(F)          float64
 22  Humidity(%)            float64
 23  Pressure(in)           float64
 24  Visibility(mi)         float64
 25  Wind_Direction         object 
 26  Wind_Speed(mph)        float64
 27  Precipitation(in)      float64
 28  Weather_Condition      object 
 29  Amenity                bool   
 30  Bump                   bool   
 31  Crossing               bool   
 32  Give_Way               bool   
 33  Junction               bool   
 34  No_Exit                bool   
 35  Railway                bool   
 36  Roundabout             bool   
 37  Station                bool   
 38  Stop                   bool   
 39  Traffic_Calming        bool   
 40  Traffic_Signal         bool   
 41  Turning_Loop           bool   
 42  Sunrise_Sunset         object 
 43  Civil_Twilight         object 
 44  Nautical_Twilight      object 
 45  Astronomical_Twilight  object 
 46  Year                   object 
 47  Month                  object 
 48  Start_Hour             object 
dtypes: bool(13), float64(12), int64(1), object(23)
memory usage: 2.2+ GB

let's create the data set of the subject of the study¶

In [45]:
df = carAccident
df.drop(df[~df['Year'].isin(['2020', '2021', '2022'])].index, inplace = True)

# fix datetime type
df['Start_Time'] = pd.to_datetime(df['Start_Time'].str[:19])
df['End_Time'] = pd.to_datetime(df['End_Time'].str[:19])
df['Weather_Timestamp'] = pd.to_datetime(df['Weather_Timestamp'].str[:19])
df['Year'] = df['Start_Time'].dt.year
df['Month'] = df['Start_Time'].dt.month
df['Hour'] = df['Start_Time'].dt.hour

df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 4505118 entries, 512217 to 7246341
Data columns (total 50 columns):
 #   Column                 Dtype         
---  ------                 -----         
 0   ID                     object        
 1   Source                 object        
 2   Severity               int64         
 3   Start_Time             datetime64[ns]
 4   End_Time               datetime64[ns]
 5   Start_Lat              float64       
 6   Start_Lng              float64       
 7   End_Lat                float64       
 8   End_Lng                float64       
 9   Distance(mi)           float64       
 10  Description            object        
 11  Street                 object        
 12  City                   object        
 13  County                 object        
 14  State                  object        
 15  Zipcode                object        
 16  Country                object        
 17  Timezone               object        
 18  Airport_Code           object        
 19  Weather_Timestamp      datetime64[ns]
 20  Temperature(F)         float64       
 21  Wind_Chill(F)          float64       
 22  Humidity(%)            float64       
 23  Pressure(in)           float64       
 24  Visibility(mi)         float64       
 25  Wind_Direction         object        
 26  Wind_Speed(mph)        float64       
 27  Precipitation(in)      float64       
 28  Weather_Condition      object        
 29  Amenity                bool          
 30  Bump                   bool          
 31  Crossing               bool          
 32  Give_Way               bool          
 33  Junction               bool          
 34  No_Exit                bool          
 35  Railway                bool          
 36  Roundabout             bool          
 37  Station                bool          
 38  Stop                   bool          
 39  Traffic_Calming        bool          
 40  Traffic_Signal         bool          
 41  Turning_Loop           bool          
 42  Sunrise_Sunset         object        
 43  Civil_Twilight         object        
 44  Nautical_Twilight      object        
 45  Astronomical_Twilight  object        
 46  Year                   int32         
 47  Month                  int32         
 48  Start_Hour             object        
 49  Hour                   int32         
dtypes: bool(13), datetime64[ns](3), float64(12), int32(3), int64(1), object(18)
memory usage: 1.3+ GB

Calculate duration as the difference between end time and start time in minute¶

In [46]:
df['Duration'] = df.End_Time - df.Start_Time 
df['Duration'] = df['Duration'].apply(lambda x:round(x.total_seconds() / 3660) )
print("The overall mean duration is: ", (round(df['Duration'].mean(),3)), 'hours')
The overall mean duration is:  10.838 hours

Wind Direction & Weather Bins¶

In [47]:
# show distinctive weather conditions 
#Wind Direction Labeling
df.loc[df['Wind_Direction']=='Calm','Wind_Direction'] = 'CALM'
df.loc[(df['Wind_Direction']=='West')|(df['Wind_Direction']=='WSW')|(df['Wind_Direction']=='WNW'),'Wind_Direction'] = 'W'
df.loc[(df['Wind_Direction']=='South')|(df['Wind_Direction']=='SSW')|(df['Wind_Direction']=='SSE'),'Wind_Direction'] = 'S'
df.loc[(df['Wind_Direction']=='North')|(df['Wind_Direction']=='NNW')|(df['Wind_Direction']=='NNE'),'Wind_Direction'] = 'N'
df.loc[(df['Wind_Direction']=='East')|(df['Wind_Direction']=='ESE')|(df['Wind_Direction']=='ENE'),'Wind_Direction'] = 'E'
df.loc[df['Wind_Direction']=='Variable','Wind_Direction'] = 'VAR'


weather_bins = {
    'Clear': ['Clear', 'Fair'],
    'Cloudy': ['Cloudy', 'Mostly Cloudy', 'Partly Cloudy', 'Scattered Clouds'],
    'Rainy': ['Light Rain', 'Rain', 'Light Freezing Drizzle', 'Light Drizzle', 'Light Freezing Rain', 
              'Drizzle', 'Light Freezing Fog', 'Light Rain Showers', 'Showers in the Vicinity', 'T-Storm', 'Thunder', 
              'Patches of Fog', 'Funnel Cloud', 'Rain / Windy', 'Squalls', 'Thunder / Windy', 'Drizzle and Fog', 
              'T-Storm / Windy', 'Smoke / Windy', 'Haze / Windy', 'Light Drizzle / Windy', 'Widespread Dust / Windy', 
              'Wintry Mix', 'Wintry Mix / Windy', 'Light Snow with Thunder', 'Fog / Windy', 'Sleet / Windy', 
              'Squalls / Windy', 'Light Rain Shower / Windy', 'Light Sleet / Windy', 'Sand / Dust Whirlwinds', 
              'Mist / Windy', 'Drizzle / Windy', 'Duststorm', 'Sand / Dust Whirls Nearby', 'Thunder and Hail', 
              'Freezing Rain / Windy', 'Partial Fog', 'Thunder / Wintry Mix / Windy', 'Patches of Fog / Windy', 
              'Rain and Sleet', 'Partial Fog / Windy', 'Sand / Dust Whirlwinds / Windy', 'Light Hail', 'Light Thunderstorm', 
              'Rain Shower / Windy', 'Sleet and Thunder', 'Drifting Snow / Windy', 'Shallow Fog / Windy', 
              'Thunder and Hail / Windy', 'Heavy Sleet / Windy', 'Sand / Windy', 'Blowing Sand', 'Drifting Snow'],
    'Heavy_Rainy': ['Heavy Rain', 'Heavy T-Storm', 'Heavy Thunderstorms and Rain', 'Heavy T-Storm / Windy', 
                    'Heavy Rain / Windy', 'Heavy Ice Pellets', 'Heavy Freezing Rain / Windy', 'Heavy Freezing Drizzle', 
                    'Heavy Rain Showers', 'Heavy Sleet and Thunder', 'Heavy Rain Shower / Windy','Heavy Rain Shower',
                    'Heavy Thunderstorms with Small Hail'],
    'Snowy': ['Light Snow', 'Snow', 'Light Snow / Windy', 'Snow Grains', 'Snow Showers', 'Snow / Windy', 
              'Light Snow and Sleet', 'Snow and Sleet', 'Light Snow and Sleet / Windy', 'Snow and Sleet / Windy', 
              'Heavy Thunderstorms and Snow', 'Snow and Thunder / Windy', 'Snow and Thunder', 'Light Snow Shower / Windy',
              'Light Snow Grains', 'Heavy Snow with Thunder', 'Heavy Blowing Snow', 'Low Drifting Snow', 
              'Thunderstorms and Snow', 'Blowing Snow Nearby', 'Light Blowing Snow'],
    'Windy': ['Blowing Dust / Windy', 'Fair / Windy', 'Mostly Cloudy / Windy', 'Light Rain / Windy', 'T-Storm / Windy', 
              'Blowing Snow / Windy', 'Freezing Rain / Windy', 'Light Snow and Sleet / Windy', 'Sleet and Thunder / Windy', 
              'Blowing Snow Nearby', 'Heavy Rain Shower / Windy'],
    'Hail': ['Hail'],
    'Volcanic Ash': ['Volcanic Ash'],
    'Tornado': ['Tornado']
}

def map_weather_to_bins(weather):
    for bin_name, bin_values in weather_bins.items():
        if weather in bin_values:
            return bin_name
    return 'Other'

df['Weather_Bin'] = df['Weather_Condition'].apply(map_weather_to_bins)
df.head()
Out[47]:
ID Source Severity Start_Time End_Time Start_Lat Start_Lng End_Lat End_Lng Distance(mi) ... Sunrise_Sunset Civil_Twilight Nautical_Twilight Astronomical_Twilight Year Month Start_Hour Hour Duration Weather_Bin
512217 A-512230 Source2 1 2022-09-08 05:49:30 2022-09-08 06:34:53 41.946796 -88.208092 NaN NaN 0.00 ... Night Night Day Day 2022 9 05 5 1 Clear
512218 A-512231 Source2 1 2022-09-08 02:02:05 2022-09-08 04:31:32 34.521172 -117.958076 NaN NaN 0.00 ... Night Night Night Night 2022 9 02 2 2 Clear
512219 A-512232 Source2 1 2022-09-08 05:14:12 2022-09-08 07:38:17 37.542839 -77.441780 NaN NaN 0.00 ... Night Night Night Night 2022 9 05 5 2 Cloudy
512220 A-512233 Source2 1 2022-09-08 06:22:57 2022-09-08 06:52:42 40.896629 -81.178452 NaN NaN 0.00 ... Night Night Day Day 2022 9 06 6 0 Cloudy
512221 A-512234 Source2 2 2022-09-08 06:36:20 2022-09-08 07:05:58 41.409359 -81.644318 NaN NaN 1.91 ... Night Day Day Day 2022 9 06 6 0 Cloudy

5 rows × 52 columns

Exploratory Data Analysis¶

Statistical Description of each numerical column¶

In [48]:
df.describe().T
Out[48]:
count mean min 25% 50% 75% max std
Severity 4505118.0 2.123295 1.0 2.0 2.0 2.0 4.0 0.429284
Start_Time 4505118 2021-08-25 18:45:36.072808448 2020-01-01 00:01:00 2020-12-23 11:14:00 2021-09-27 12:46:04.500000 2022-04-30 21:56:14.249999872 2022-12-31 23:59:03 NaN
End_Time 4505118 2021-08-26 05:51:08.807245056 2020-01-01 00:34:29 2020-12-23 19:02:43.500000 2021-09-28 01:13:51 2022-05-01 16:09:32.249999872 2023-03-31 23:59:00 NaN
Start_Lat 4505118.0 35.982806 24.5548 33.027574 35.782829 39.962899 49.000504 5.177597
Start_Lng 4505118.0 -94.077159 -124.548074 -117.137228 -86.685037 -80.216664 -67.48413 17.430231
End_Lat 3348607.0 35.945393 24.566013 32.935342 35.87789 39.969813 49.002223 5.312601
End_Lng 3348607.0 -94.678953 -124.545748 -117.409456 -86.703416 -80.173215 -67.48413 17.891797
Distance(mi) 4505118.0 0.724965 0.0 0.0 0.135 0.749 441.75 1.870308
Weather_Timestamp 4426814 2021-08-25 21:05:40.420118528 2020-01-01 00:12:00 2020-12-23 18:53:00 2021-09-27 19:48:30 2022-04-30 19:52:00 2022-12-31 23:56:00 NaN
Temperature(F) 4403414.0 61.959102 -89.0 50.0 64.0 77.0 203.0 19.067869
Wind_Chill(F) 4368280.0 60.740209 -89.0 50.0 64.0 76.0 196.0 21.182693
Humidity(%) 4396807.0 64.248426 1.0 48.0 66.0 84.0 100.0 22.979141
Pressure(in) 4418158.0 29.363386 0.0 29.18 29.7 29.96 58.63 1.095945
Visibility(mi) 4400499.0 9.085471 0.0 10.0 10.0 10.0 140.0 2.517412
Wind_Speed(mph) 4382734.0 7.319508 0.0 3.0 7.0 10.0 1087.0 5.530296
Precipitation(in) 4312635.0 0.005723 0.0 0.0 0.0 0.0 36.47 0.052758
Year 4505118.0 2021.129528 2020.0 2020.0 2021.0 2022.0 2022.0 0.797569
Month 4505118.0 6.755552 1.0 4.0 7.0 10.0 12.0 3.640925
Hour 4505118.0 12.4661 0.0 8.0 13.0 17.0 23.0 5.683172
Duration 4505118.0 10.838266 0.0 1.0 1.0 2.0 25889.0 277.696796

Statistical Description of each catogorial columns¶

In [49]:
df.select_dtypes(include = ['object','bool']).describe().T
Out[49]:
count unique top freq
ID 4505118 4505118 A-512230 1
Source 4505118 3 Source1 3348607
Description 4505114 2266825 A crash has occurred causing no to minimum del... 9593
Street 4495025 262146 I-95 S 47576
City 4504950 12339 Miami 150492
County 4505118 1776 Los Angeles 279370
State 4505118 49 CA 1003321
Zipcode 4504077 570034 33186 7270
Country 4505118 1 US 4505118
Timezone 4500637 4 US/Eastern 2197472
Airport_Code 4488918 1988 KCQT 64788
Wind_Direction 4382691 10 CALM 784869
Weather_Condition 4404143 109 Fair 2128073
Amenity 4505118 2 False 4452904
Bump 4505118 2 False 4502759
Crossing 4505118 2 False 4037038
Give_Way 4505118 2 False 4486759
Junction 4505118 2 False 4201423
No_Exit 4505118 2 False 4493668
Railway 4505118 2 False 4468606
Roundabout 4505118 2 False 4504983
Station 4505118 2 False 4385179
Stop 4505118 2 False 4381382
Traffic_Calming 4505118 2 False 4500392
Traffic_Signal 4505118 2 False 3972257
Turning_Loop 4505118 1 False 4505118
Sunrise_Sunset 4483654 2 Day 2991746
Civil_Twilight 4483654 2 Day 3190208
Nautical_Twilight 4483654 2 Day 3406089
Astronomical_Twilight 4483654 2 Day 3583656
Start_Hour 4505118 24 16 351260
Weather_Bin 4505118 9 Clear 2128073

The 20 states with the highest number of accidents¶

In [50]:
states_by_accident = df.County.value_counts()
top20_state = states_by_accident.head(20)

sns.barplot(y=top20_state.keys(),x=top20_state.values)
plt.title('The 20 states with the highest number of accidents')
plt.tight_layout()
No description has been provided for this image
In [51]:
#Los Angeles County has the highest number of accidents by a significant margin, indicating it is a major hotspot for accidents.

Weather condition Crashes counts : graph¶

In [52]:
fig, ax = plt.subplots(figsize = (7.5,5))
c = sns.countplot(x="Year", data=df, orient = 'v', palette = "crest_r")
c.set_title("Counts of Accidents in Year")
for i in ax.patches:
    count = '{:,.0f}'.format(i.get_height())
    x = i.get_x()+i.get_width()-0.60
    y = i.get_height()+10000
    ax.annotate(count, (x, y))
plt.show()
No description has been provided for this image
In [53]:
#2020: The lower number of accidents in 2020 could be attributed to the COVID-19 pandemic, where lockdowns and reduced travel may have resulted in fewer accidents.
#2021 and 2022: The increase in accidents in these years could be due to the gradual return to normalcy, with more vehicles on the road and higher traffic volumes as restrictions were lifted.

Average Duration Of Accidents by Severity¶

In [54]:
avg_time = df.groupby('Severity')['Duration'].mean()
# Plot the results
plt.figure(figsize=(8, 5))
avg_time.plot(kind='bar', color='skyblue')
plt.xlabel('Severity')
plt.ylabel('Average Duration (hours)')
plt.title('Average Duration of Accidents by Severity')
plt.xticks(rotation=0)
for index, value in enumerate(avg_time):
    plt.text(index, value, f'{value:.2f}', ha='center', va='bottom')
plt.show()
No description has been provided for this image
In [55]:
#Severity 1: The average duration of accidents with severity 1 is approximately 0.79 hours (around 47 minutes). These are likely minor accidents with quick resolution times.
#Severity 2: Accidents with severity 2 have an average duration of 11.27 hours. This significant increase suggests that severity 2 accidents are more serious and require more time for resolution, possibly involving injuries or more substantial vehicle damage.
#Severity 3: The average duration for severity 3 accidents is around 0.96 hours (approximately 58 minutes), indicating they are slightly more severe than level 1 but still resolved relatively quickly.
#Severity 4: The most severe accidents (severity 4) have an average duration of 39.58 hours, which is a substantial duration. These accidents are likely very serious, involving significant road blockages, severe injuries, or fatalities, requiring extensive time for investigation and cleanup.

¶

The number of accidents by state codes.¶

In [56]:
fig, ax = plt.subplots(figsize = (15,5))
c = sns.countplot(x="State", data=df, orient = '', palette = "crest_r", order = df['State'].value_counts().index)
c.set_title("States with No. of Accidents");
No description has been provided for this image
In [57]:
#California (CA) has the highest number of accidents, approaching 1 million.
#Florida (FL) follows with a substantial number of accidents, significantly lower than California but still notably high.
#The distribution of accidents across states varies widely, with some states showing very high numbers while others have relatively fewer accidents.

Top 50 Cities with Highest No. of Accidents¶

In [58]:
fig, ax = plt.subplots(figsize = (15,5))
c = sns.countplot(x="City", data=df, order=df.City.value_counts().iloc[:50].index, orient = 'v', palette = "crest_r")
c.set_title("Top 50 Cities with Highest No. of Accidents")
c.set_xticklabels(c.get_xticklabels(), rotation=90)
plt.show()
No description has been provided for this image
In [ ]:
#The cities with the highest number of accidents are major metropolitan areas, such as Miami, Los Angeles, and Orlando. This makes sense due to their large populations and heavy traffic.

Accident cases for different weather conditions occur in US¶

In [60]:
plt.figure(figsize=(10,5))
sns.barplot(x=df['Weather_Bin'].value_counts().iloc[:50], y=df['Weather_Bin'].value_counts().iloc[:50].index)
plt.title(" Accident cases for different weather conditions occur in US ",size=17,color="grey")
plt.xlabel('Weather Condition')
plt.ylabel('No. of accidents')
plt.show()
No description has been provided for this image
In [61]:
#Clear weather conditions account for the highest number of accidents, with over 2 million cases. This indicates that most accidents occur under clear weather conditions.
#Cloudy weather is the second most common condition associated with accidents, with around 1.5 million cases.

The time period with the most accidents¶

In [62]:
fig, ax = plt.subplots(figsize = (10,5))
sns.countplot(x="Hour", data=df, orient = 'v', palette = "icefire_r")
plt.annotate('Morning Peak',xy=(6,350000), fontsize=12)
plt.annotate('Evening Peak',xy=(15,350000), fontsize=12)
plt.annotate('go to work',xy=(8,0),xytext=(0,95000),arrowprops={'arrowstyle':'-|>'}, fontsize=12)
plt.annotate('get off work',xy=(17,0),xytext=(19,95000),arrowprops={'arrowstyle':'-|>'}, fontsize=12)
plt.title('The time period with the most accidents')
plt.show()
No description has been provided for this image
In [63]:
#The number of accidents increases significantly starting from around 5 AM, peaking at 8 AM.
#This peak corresponds to the morning rush hour when many people are commuting to work or school. Increased traffic volume during this time likely contributes to the higher number of accidents.
In [64]:
data = df

data['Start_Time'] = pd.to_datetime(data['Start_Time'])


data['Month'] = data['Start_Time'].dt.to_period('M')
monthly_accidents = data.groupby('Month').size().reset_index(name='Accidents')
monthly_accidents['Month'] = monthly_accidents['Month'].dt.to_timestamp()


fig = px.line(monthly_accidents, x='Month', y='Accidents', title='Monthly Number of Accidents',
              labels={'Month': 'Month', 'Accidents': 'Number of Accidents'},
              template='plotly_dark')


fig.update_traces(line_color='cyan', line_width=2)
fig.update_layout(title_font_size=24, title_x=0.5)

fig.show()
In [65]:
#There appear to be periodic peaks and troughs, suggesting possible seasonal effects on the number of accidents. 
#For instance, certain months may see higher accident rates due to adverse weather conditions, holidays, or other events that increase traffic volume.

Machine Learning & Predict¶

Data Cleaning¶

In [66]:
#Features 'ID' doesn't provide any useful information about accidents themselves. 'TMC', 'Distance(mi)', 'End_Time' (we have start time), 'Duration', 'End_Lat', and 'End_Lng'(we have start location) 
#can be collected only after the accident has already happened and hence cannot be predictors for serious accident prediction.
In [67]:
df = df.drop(['ID','Start_Time',	'Start_Lat','Start_Lng','Description','Distance(mi)', 'End_Time', 'Duration', 
              'End_Lat', 'End_Lng','Weather_Timestamp'], axis=1)
In [68]:
#categorical columns 
In [69]:
cat_names = [ 'Country', 'Timezone', 'Amenity', 'Bump', 'Crossing', 
             'Give_Way', 'Junction', 'No_Exit', 'Railway', 'Roundabout', 'Station', 
             'Stop', 'Traffic_Calming', 'Traffic_Signal', 'Turning_Loop', 'Sunrise_Sunset', 
             'Civil_Twilight', 'Nautical_Twilight', 'Astronomical_Twilight']
print("Unique count of categorical features:")
for i in cat_names:
  print(i,df[i].unique().size)
Unique count of categorical features:
Country 1
Timezone 5
Amenity 2
Bump 2
Crossing 2
Give_Way 2
Junction 2
No_Exit 2
Railway 2
Roundabout 2
Station 2
Stop 2
Traffic_Calming 2
Traffic_Signal 2
Turning_Loop 1
Sunrise_Sunset 3
Civil_Twilight 3
Nautical_Twilight 3
Astronomical_Twilight 3
In [70]:
#Drop 'Country' and 'Turning_Loop' for they have only one class. 
In [71]:
df = df.drop(['Country','Turning_Loop'], axis=1)

Correlations¶

In [72]:
num_corr = df.select_dtypes(include = ['float64','int64']).corr()
sns.heatmap(num_corr)
Out[72]:
<Axes: >
No description has been provided for this image
In [ ]:
#The 'Severity' of incidents shows some positive correlation with 'Precipitation(in)' and 'Temperature(F)', although these correlations are not very strong.
#There appears to be a weak negative correlation with 'Visibility(mi)', suggesting that lower visibility might be associated with higher severity, but the effect is not very pronounced.

Calculating Cramer's statistics for categorical columns¶

In [21]:
import scipy.stats as stats
df['Severity_Label'] = df['Severity'].apply(lambda x: 'Very High' if x == 4 else ('High' if x == 3 else ('Medium' if x == 2 else 'Low')))
categorical_features = ['Severity_Label','Street','Weather_Bin','State']

def cramers_v(x, y):
    confusion_matrix = pd.crosstab(x, y)
    chi2 = stats.chi2_contingency(confusion_matrix)[0]
    n = confusion_matrix.sum().sum()
    phi2 = chi2/n
    r, k = confusion_matrix.shape
    phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))
    rcorr = r - ((r-1)**2)/(n-1)
    kcorr = k - ((k-1)**2)/(n-1)
    return np.sqrt(phi2corr / min((kcorr-1), (rcorr-1)))


results = pd.DataFrame(index=categorical_features, columns=categorical_features)

for i in range(len(categorical_features)):
    for j in range(len(categorical_features)):
        if i == j:
            results.iloc[i, j] = np.nan  # Diagonal
        else:
            results.iloc[i, j] = cramers_v(df[categorical_features[i]], df[categorical_features[j]])

results = results.astype(float)
print("Cramér's V Matrix:")
print(results)

plt.figure(figsize=(10, 8))
sns.heatmap(results, annot=True, cmap='viridis', cbar=True)
plt.title("Cramér's V Heatmap for Categorical Variables")
plt.show()
Cramér's V Matrix:
                Severity_Label    Street  Weather_Bin     State
Severity_Label             NaN  0.350105     0.033211  0.173223
Street                0.350105       NaN     0.232840  0.748451
Weather_Bin           0.033211  0.232840          NaN  0.124383
State                 0.173223  0.748451     0.124383       NaN
No description has been provided for this image
In [22]:
#There is a moderate association between the severity label of accidents and the street where the accidents occurred.
#This indicates that certain streets might have a higher tendency to experience accidents of particular severities. 

Models & Predict¶

Handling Missing Data¶

In [75]:
print(df[['Street', 'State', 'City', 'Weather_Bin', 'Hour', 'Severity']].isnull().sum())
missing = pd.DataFrame(df[['Street', 'State', 'City', 'Weather_Bin', 'Hour', 'Severity']].isnull().sum()).reset_index()
missing.columns = ['Feature', 'Missing_Percent(%)']
missing['Missing_Percent(%)'] = missing['Missing_Percent(%)'].apply(lambda x: x / df.shape[0] * 100)
missing.loc[missing['Missing_Percent(%)']>0,:].sort_values(by = 'Missing_Percent(%)')
Street         10093
State              0
City             168
Weather_Bin        0
Hour               0
Severity           0
dtype: int64
Out[75]:
Feature Missing_Percent(%)
2 City 0.003729
0 Street 0.224034
In [ ]:
#Imputing City and Street columns with KNN Imputer Method
In [ ]:
from sklearn.impute import KNNImputer
from sklearn.preprocessing import LabelEncoder

label_encoders = {}
for column in ['City', 'Street']:
    le = LabelEncoder()
    df[column] = le.fit_transform(df[column].astype(str))  # Convert to string to handle NaNs
    label_encoders[column] = le

# Apply KNNImputer to the relevant columns
imputer = KNNImputer(n_neighbors=5)
df[['City', 'Street']] = imputer.fit_transform(df[['City', 'Street']])

# Inverse transform the encoded columns back to original categorical format
for column in ['City', 'Street']:
    df[column] = label_encoders[column].inverse_transform(df[column].round().astype(int))
In [82]:
df[['City','Street']].isnull().sum()
Out[82]:
City      0
Street    0
dtype: int64
In [83]:
df = df.sample(n=500000, random_state=42)
#Due to hardware limitations, model validation techniques such as cross-validation could not be performed. Therefore, models were evaluated directly using the training and test sets.
In [89]:
# Select the relevant columns
model_df = df[['Street', 'State', 'City', 'Weather_Bin', 'Hour', 'Severity']]


X = model_df.drop('Severity', axis=1)
y = model_df['Severity']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
In [90]:
X_train.shape
Out[90]:
(350000, 5)
In [91]:
X_test.shape
Out[91]:
(150000, 5)
In [145]:
#Encoding
In [124]:
#Due to the variables in the table having more than one hundred categories, label encoding was used instead of the one-hot encoding module.
In [100]:
#label encoding
for column in ['Street', 'State', 'City', 'Weather_Bin']:
    le = LabelEncoder()
    X_train[column] = le.fit_transform(X_train[column])
    label_encoders[column] = le


for column in ['Street', 'State', 'City', 'Weather_Bin']:
    le = LabelEncoder()
    X_test[column] = le.fit_transform(X_test[column])
    label_encoders[column] = le
In [101]:
X_test
Out[101]:
Street State City Weather_Bin Hour
5453590 24831 8 5243 1 17
7022755 23024 6 1043 0 13
4041934 30251 38 2155 0 18
5641001 36858 3 2012 0 15
4802439 16128 29 4916 0 12
... ... ... ... ... ...
6430936 16577 38 1001 0 13
6824164 277 16 5713 0 22
1045717 29597 32 1548 0 1
4589986 42146 25 2182 0 16
867607 26627 41 7022 1 16

150000 rows × 5 columns

In [102]:
X_train
Out[102]:
Street State City Weather_Bin Hour
5895872 42118 8 6381 0 15
5669769 61759 32 7766 0 13
6549343 326 40 3042 1 21
4748814 27839 43 8194 0 15
6918065 30669 22 5183 1 15
... ... ... ... ... ...
4324204 56910 3 5839 0 13
594911 19609 18 6312 1 8
5149718 48787 41 4337 1 17
3981790 21639 3 7115 0 14
1531252 12454 21 6943 1 17

350000 rows × 5 columns

In [ ]:
#Logistic Regression
In [126]:
log_model = LogisticRegression(max_iter=1000).fit(X_train,y_train)
y_pred = log_model.predict(X_test)
In [127]:
logistic_accuracy = accuracy_score(y_test, y_pred)
In [128]:
logistic_accuracy
Out[128]:
0.8720666666666667
In [118]:
#Decision Tree Classifier
In [129]:
Dcf_model = DecisionTreeClassifier().fit(X_train,y_train)
y_pred = Dcf_model.predict(X_test)
In [130]:
Dcf_accuracy = accuracy_score(y_test, y_pred)
Dcf_accuracy
Out[130]:
0.7687666666666667
In [131]:
# Random Forest Classsifier
In [144]:
Rf_model = RandomForestClassifier().fit(X_train,y_train)
y_pred = Rf_model.predict(X_test)
Rf_accuracy = accuracy_score(y_test, y_pred)
Rf_accuracy
Out[144]:
0.8672733333333333
In [143]:
models = {'Logistic Regression Classifier':LogisticRegression(),
         'Decision Tree Classifier' :DecisionTreeClassifier() ,
         'Random Forest Classifier':RandomForestClassifier()}

results = []
for model_name, model in models.items():
    # Fit the model
    model.fit(X_train, y_train)
    
    # Predict and calculate accuracy
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    
    # Append the result
    results.append((model_name, accuracy))

# Convert results to DataFrame and display
results_df = pd.DataFrame(results, columns=['Model', 'Accuracy'])

# Find the best model
best_model_name = results_df.loc[results_df['Accuracy'].idxmax(), 'Model']
best_model_score = results_df['Accuracy'].max()

print(f'The best model is: {best_model_name} with an accuracy of {best_model_score}')

# Plot the results
plt.figure(figsize=(10, 6))
plt.bar(results_df['Model'], results_df['Accuracy'], color='skyblue')
plt.xlabel('Model')
plt.ylabel('Accuracy')
plt.title('Model Comparison')
plt.ylim(0, 1)
plt.show()
The best model is: Logistic Regression Classifier with an accuracy of 0.8720666666666667
No description has been provided for this image
In [ ]:
#Accuracy: The Logistic Regression Classifier achieved the highest accuracy (approximately 87.2%).
#Logistic Regression is performing better than both Decision Tree and Random Forest Classifiers.

Cluster Analysis¶

In [152]:
from yellowbrick.cluster import KElbowVisualizer
from sklearn.cluster import KMeans
model_df.reset_index(drop=True,inplace = True)
kdf = model_df.copy()


for column in ['Street', 'State', 'City', 'Weather_Bin']:
    lekmeans = LabelEncoder()
    kdf[column] = lekmeans.fit_transform(kdf[column])
    label_encoders[column] = lekmeans
    
    
kmeans = KMeans()
visu = KElbowVisualizer(kmeans, k = (2,20))
visu.fit(kdf)
visu.poof()
No description has been provided for this image
Out[152]:
<Axes: title={'center': 'Distortion Score Elbow for KMeans Clustering'}, xlabel='k', ylabel='distortion score'>
In [ ]:
#Now We use the best cluster number
In [154]:
kmeans = KMeans(n_clusters = 5).fit(kdf)
kmeans
Out[154]:
KMeans(n_clusters=5)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KMeans(n_clusters=5)
In [174]:
clusters = kmeans.labels_
Severity_clusters = pd.DataFrame({"Severity": kdf.Severity, "Cluster": clusters})
State_clusters = pd.DataFrame({"State": model_df.State, "Cluster": clusters})
In [179]:
kdf['Cluster Numbers '] = clusters
kdf
Out[179]:
Street State City Weather_Bin Hour Severity Cluster Numbers
0 419 3 1359 0 15 2 2
1 62003 3 1604 1 15 2 3
2 45170 8 9068 4 7 2 0
3 34088 41 5212 1 15 3 0
4 33893 41 5212 1 17 3 0
... ... ... ... ... ... ... ...
499995 31577 41 3133 1 17 3 0
499996 33731 8 412 0 17 2 0
499997 34119 33 4726 1 15 3 0
499998 33561 3 7011 4 8 2 0
499999 86526 36 7927 4 2 2 1

500000 rows × 7 columns

SUMMARY¶

  • The dataset mentioned in the above link was reduced to cover car accidents in 49 states of the USA from 2020 to 2022. Data cleaning was performed, and the analysis began.
  • Descriptive statistics provided an overview of the data, and exploratory data analysis methods were used to generate relevant graphs. It was observed that accident rates increased by 49.5% between 2020 and 2022. Most accidents occurred in clear weather, and the three states with the highest number of accidents were Los Angeles, Miami-Dade, and Orange, respectively.
  • The model was evaluated using three different classification models (Logistic Regression, Decision Tree, and Random Forest). The best classification model was determined to be Logistic Regression with an accuracy of 87.2%.
  • As an unsupervised learning model, the KMeans clustering algorithm was used, and the data was labeled into 5 clusters.

Created BY¶

Ayşegül ÜLKER